An effective ensemble learning approach for classification of glioma grades based on novel MRI features

The preoperative diagnosis of brain tumors is important for therapeutic planning as it contributes to the tumors’ prognosis. In the last few years, the development in the field of artificial intelligence and machine learning has contributed greatly to the medical area, especially the diagnosis of the grades of brain tumors through radiological images and magnetic resonance images. Due to the complexity of tumor descriptors in medical images, assessing the accurate grade of glioma is a major challenge for physicians. We have proposed a new classification system for glioma grading by integrating novel MRI features with an ensemble learning method, called Ensemble Learning based on Adaptive Power Mean Combiner (EL-APMC). We evaluate and compare the performance of the EL-APMC algorithm with twenty-one classifier models that represent state-of-the-art machine learning algorithms. Results show that the EL-APMC algorithm achieved the best performance in terms of classification accuracy (88.73%) and F1-score (93.12%) over the MRI Brain Tumor dataset called BRATS2015. In addition, we showed that the differences in classification results among twenty-two classifier models have statistical significance. We believe that the EL-APMC algorithm is an effective method for the classification in case of small-size datasets, which are common cases in medical fields. The proposed method provides an effective system for the classification of glioma with high reliability and accurate clinical findings.

www.nature.com/scientificreports/ a classification model for glioma grades using a statistical analysis of tumor descriptors, led to achieving an accurate differentiation between glioma grades, which assists physicians in distinguishing them.
Many factors and MRI characteristics can be used in the clinical center for brain tumor diagnosis.For example, the analysis of necrosis, edema, enhancement of non-enhanced MRI tumors and different MRI features that appear after tumor enhancement 6 .Furthermore, determining the malignancy grade of glioma depends on the specialist's experience and level of qualifications.The diagnosis of an MRI brain tumor on visual examination through magnetic resonance image analysis may take a long time as it requires strong experience for the result of the diagnosis to be accurate 9 .In addition, with the enhancement of the MRI protocols and the development in this industry, the diagnosis of glioma grading based on visual diagnosis is considered difficult 10 .Therefore, in attempting to enhance the sensitivity and quality of classification methods, we proposed novel MRI features integrated with an effective ensemble learning method.
Using machine learning in the detection and classification of malignant brain tumors poses' several challenges 4,5 .For example, limited data availability of brain tumor images can affect the performance and generalization ability of models.In addition, machine learning models are prone to overfitting, especially when dealing with limited data.Data imbalance is another challenge in which the distribution of different types of brain tumors in datasets can be highly imbalanced.This can lead to biased models and difficulty in accurately detecting malignant tumors.Brain tumors exhibit considerable heterogeneity in terms of size, shape, texture, and location.The selected feature extraction method and machine learning models need to be robust enough to accurately classify tumors despite these variations.Addressing these challenges requires careful design that takes the consideration all the challenges previously mentioned.
To address these challenges, we proposed an automated classification model based on a novel feature extraction method integrated with an effective machine-learning algorithm called EL-APMC 11 .EL-APMC is built on an ensemble of base classifiers that adaptively combine to maximize classification results.This structure allows for several benefits; for example, incorporating more classifiers can effectively reduce overfitting and improve their generalization performance on unseen data.In addition, EL-APMC is trained using bootstrap bagging without replacement which can mitigate the effects of class imbalance.Unlike other ensemble learning methods that use fixed fusion methods, EL-APMC uses an adaptive fusion method called Power Mean Combiner (PMC) that is trained to match data statistics which results in maximizing classification accuracy.Also, we used the subspace training method to maximize independence among base classifiers and improve diversity which brings benefits such as improved accuracy, robustness, and reduced overfitting.As a result, the EL-APMC algorithm is considered a promising technique that is used to classify small-size datasets which is a common problem in the medical domain.The effectiveness EL-APMC algorithm is compared with twenty-one machine learning methods which are considered state-of-the-art machine learning.The findings indicate that the EL-APMC algorithm demonstrated notable performance in both classification accuracy (88.73%) and F1-score (93.12%) when evaluated on the BRATS2015 MRI Brain Tumor dataset.In addition, this work investigates the effectiveness of the proposed MRI features on the classification of glioma grades.
The rest of the paper is organized as follows.The recent literature related to the brain tumors classification is introduced in Section II.Section III discusses the impact of dataset size on classification performance.Section IV reviews the feature extraction method and working principles of EL-APMC algorithms.Section V discusses the paper's results.Finally, section IV gives the main conclusion and future direction.

Related work
Various machine learning methods have been used and proposed in recent years to classify brain tumors as shown in Table 1.In the last few years, the use of machine learning and the application of AI increased rapidly and many researchers have proposed different classification methods.In 12 , MRI glioma grades have been classified into three grades (II, III, and IV).The classification system was developed using Gabor texture as input features and SVM was selected as the classification model.The results show a classification accuracy of 88%.While 13 , has proposed a classification system based on statistical MRI features and K-means clustering to differentiate low grades from high grades of MRI brain tumors and achieved a classification accuracy of 80.40%.Similarly, MRI images have been classified into two classes (normal and abnormal) 14 .The proposed model consists of many phases starting with an enhancement of the brain MRI images using Shift-Invariant Shearlet Transform (SIST).Then researchers proposed the Gabor Grey Level Co-occurrence Matrix (GLCM) and Discrete Wavelet Transform (DWT) for the features extraction phase.Finally, these selected features were fed to a feed-forward backpropagation neural network and obtained an accuracy rate of 99.8%.Hsieh et al. 15 suggested a classification model using logistic regression to classify low grades against high grades based on Local Binary Pattern (LBP) texture features and achieved a classification accuracy of 93%.Deep learning based on CNN has also been proposed to classify MRI glioma grades 16 .The work has accomplished a classification accuracy of 91.16%.Shree et al. 17 proposed a brain tumor classification model for binary classification (normal and abnormal).They used GLCM for feature extraction and a PNN classifier, which resulted in 95% classification accuracy.The mean intensities of the MR regions were used to produce a classification system for glioma grades using SVM as a classification method 18 and the obtained classification result was 93%.Likewise 19 , proposed an automatic tumor detection and segmentation based on a hybrid energy-efficient method for automatic tumor detection and segmentation.The developed methods consist of seven long phases to achieve 98% accuracy.In 20 , a two-stage ensemble learning approach is proposed to classify three glioma grades (Glioma Grade II, Glioma Grade III, and Glioma Grade-IV).The number of subjects used in the study is 135 (90 patients and 45 controls) and five characteristics are used in classification which is human telomerase reverse transcriptase (hTERT), chitinase-like protein (YKL-40), interleukin 6 (IL-6), tissue inhibitor of metalloproteinase-1 (TIMP-1) and neutrophil/lymphocyte ratio (NLR).They claimed to achieve better classification accuracy compared to the state-of-the-art machine learning classifiers.
www.nature.com/scientificreports/The work given in 21 used anisotropic noise removal filtering, GLCM for feature selection, and SVM classifier to identify the tumor region from brain MRI images.According to their results, they can localize tumor regions with 98% accuracy.Rajeev et al. 22 investigated a hybrid deep learning approach for brain tumor classification, by using an improved Gabor wavelet transform and BiLSTM network.The experiments have been done based on the Kaggle dataset which is public and open source, the dataset includes four directories such as glioma-tumor, meningioma-tumor, no-tumor, and pituitary-tumor.The proposed methods have been implemented using the MATLAB platform and the highest performance accuracy was achieved at 98.4%.An automated classification system for the segmentation of MRI brain tumors has been accomplished based on the combination of the Interval Type-II fuzzy logic system and an artificial bee colony algorithm to identify tumor regions 23 .The developed algorithm has investigated using image sequences available in the BRATS challenge datasets (2015, 2017, and  2018).The researcher claimed to achieve 96% classification results in terms of the Dice-Overlap Index (DOI).The summary of the classification models and features used for the classification of glioma grades and their details are shown in Table 1.

Impact of dataset size on classification performance
In this section, we review the challenges of training machine learning models on small data sizes and investigate the most effective machine learning algorithms that target this issue.Dataset plays a pivotal role in modern healthcare services for example in personalized medicine and automated diagnosis 24 .The size of data is considered a crucial factor in determining the performance of a machine learning algorithm.In practice, small data size leads to overfitting problems while large data size leads to better classification results [25][26][27] .
Data collection in the medical area faces many obstacles such as rare medical conditions and medical organizations' privacy.Deep learning algorithms provide good results in different applications.However, to get an accurate result with a deep learning algorithm it is necessary to train it with a large amount of data which in some cases is not available 28 .In addition, training machine-learning algorithms on large data sizes require a considerable amount of time and computation resources that may not be available in certain circumstances.
Many efforts in literature tried to define the size of small datasets but there is no clear definition for that.For example, Shawe-Taylor et al. 29 presented a method that specifies the minimum number of features to achieve the desired accuracy called Probably Approximately Correct (PAC).While 30 proposed, an algorithm based on information theory for defining a minimum data size.Other work 31 examined different works that dealt with small data sizes to define a range for small dataset sizes.
Training a machine-learning algorithm on a small data size is a challenging task since the data does not represent the actual data distribution, which may lead to an overfitting problem.In an overfitting situation, the classification algorithm performs well on training data and provides poor performance on testing data.In other words, the fitted algorithm is generalized well on training data which does not represent the actual data distributions.In this case, the trained model is not generalized well and leads to unreliable and biased classification results.Increasing the accuracy of classification on limited data size is a challenging research area.To address this problem, some literature focused on increasing the accuracy of the classification algorithm on a limited-size dataset while others investigated the effect of the dataset size on the performance of the classification algorithm 32,33 .In this work, EL-APMC as well as the state-of-the-art machine learning methods are investigated to tackle the problem of classification limited data size.In the following sections, we reviewed our proposal and compared its performance against the state of art machine learning algorithms.Then, the comparison is evaluated among different classification metrics.

Proposed method
In this section, we reviewed several MRI descriptors of brain tumors that are used to extract eight novel features.Then we described the structure of the EL-APMC algorithm that used to develop an automated classification system for glioma grades.

Feature extraction
In this experimental work, standard labeled datasets were used to evaluate the proposed approach, namely BRATS2015 34 .This dataset has a labeled identification layer and it is used to generate four masks to individually bring in labeled regions.These regions include necrosis, edema, non-enhanced, and enhanced tumors.Visualizations of these brain tumor descriptors show different recognized regions for a brain tumor, which are extracted using T1 with enhancement as shown in Fig. 1.The presence of tumor descriptors is measured by utilizing the number of pixels within each labeled region of the tumor.A search process is conducted to determine the total number of pixels in each region across all slices.This procedure is carried out for all patients in the dataset.Subsequently, an average of the results is calculated for each patient.The following equation is used to determine four MRI features that are used in this work.
The average presence of tumor descriptors denoted as Name_M , is calculated based on the label identification layer (SEG) provided by the dataset.z represents the total number of MRI slices that contain a tumor, while x and y represent the coordinates of each MRI slice.An additional four novel features are extracted and involved in the classification process.These features are measured based on the following equations; Names_M takes the following values tC_M , tnC_M , Edm_M , and Nec_M .Where tC_M , tnC_M , Edm_M , and Nec_M are the average presence of contrast enhancement, non-enhancement, edema, and necrosis respec- tively.They are calculated from (1) where tC_R , tnC_R , Edm_R , and Nec_R are the resultant ratios of tumor enhancement, non-enhancement, edema, and necrosis respectively.
(1) EL-APMC is a classification method proposed in 11 , which belongs to the family of ensemble learning methods.
In this work, a theoretical framework is developed to understand how the fusion methods for ensemble learning systems interact with base classifiers.Based on the theoretical results a new adaptive classification method is proposed and achieved notable results against several fusion methods.In this work, we investigate the strengthening of the classification accuracy of the EL-APMC algorithm and compare it with the state-of-the-art machine learning algorithms in case of limited dataset size.The fusion method used in the EL-APMC is called power mean combiner (PMC) and is defined as follows where k 1 , k 2 , . . ., k N are positive real numbers that represent base classifiers outputs and α is a real number that represents the aggregation method used in f α (.) .PMC refers to a function that combines infinite arithmetic fusion operations, including arithmetic, geometric mean, harmonic mean, and more.However, it is unclear why certain fusion methods work better than others for a given classification task.Fortunately, PMC can aggregate infinite fusion functions, and we can search for an optimal function that minimizes classification error.
The working principle of the EL-APMC is described as follows.The ensemble setup consists of two main phases: training and testing.During the training phase, a fivefold cross-validation approach is employed.In each fold, the data is pre-processed before being used to train individual classifiers.The goal of the pre-processing stage is to introduce diversity among the base classifiers.To achieve this, we employ two well-known methods, namely bagging, and subspace.The combination of bagging and subspace techniques enhances randomness and minimizes the generalization error at the decision combiner stage.In bagging, a bootstrap method is utilized, it is a technique of generating multiple bootstrap samples from the original training dataset to train individual base learners within the ensemble.Each bootstrap sample is created by randomly sampling observations from the original dataset without replacement, resulting in multiple subsets that may contain duplicate instances.These subsets are then used to train each base learner independently which helps improve the diversity among the base learners.This is crucial for enhancing the overall performance and robustness of the ensemble model.By training base learners on different subsets of the data, bootstrapping reduces the risk of overfitting and helps capture different aspects of the underlying data distribution.
The bootstrap process generates N subsets each generated bootstrap subset is divided into two equal parts: one for In-Bag (InBag) samples and the other for Out-of-Bag (OutBag) samples.The InBag portion is utilized to train the N base classifiers, while the OutBag samples are used to estimate individual classifier weights, which are later used in the decision combination process.Additionally, all the OutBag replicas are aggregated and utilized to train the PMC.This setup offers the advantage of eliminating the need for additional data to train the PMC, as the OutBag samples are used for this purpose.Bagging with bootstrap aggregating is considered a regularization technique that reduces overfitting and improves generalization performance.Another method used to control regularization is early stopping which is a method used to prevent overfitting by halting the training process when the performance on a validation set starts to degrade.In our proposal, we can control the number of base classifiers (N) that are used in the ensemble to prevent them from becoming overly complex.
The second method employed to enhance diversity is the random subspace technique.Instead of using the entire feature set for training each base model, a random subset of features is selected for each model.After selecting the feature subset, each base model is trained on the corresponding subset of features.This results in improving the performance and robustness of ensemble learning, particularly in scenarios where overfitting is a concern or where datasets have high dimensionality.Using bootstrap bagging and random subspace training as well as performing thorough hyperparameter tuning can mitigate underfitting in ensemble learning and improve the predictive performance of the model.
The number of features used is determined by taking the square root ( m r ) of the total number of predictors generated from the bootstrap sampling.In the final stage of training, the aggregated replicas of OutBag samples are employed to train PMC.The approach used to implement PMC is called Adaptive PMC with Threshold Estimation (APMCT).This method involves estimating the probability density functions (pdfs) of the classes with an optimal threshold.An adaptive algorithm is utilized to estimate the prior and posterior probabilities of the combiner.For the two classes case, the optimal threshold is determined by minimizing the classification error using the following formula.
where P e is the classification error, P w j , j = 1, 2 is the classes' prior probabilities, F j (.), j = 1, 2 is the cumulative distribution function of the class w j , and µ opt, is the optimal threshold.F j (.) is estimated using the histogram technique.During the training phase, the EL-APMC algorithm minimizes P e according to the following formula There are many optimization algorithms available to solve (8), among these, we used surrogate optimization.It refers to a method used in optimization algorithms where a surrogate model is employed to approximate the behavior of a complex, computationally expensive, or difficult-to-evaluate objective function.Instead of directly evaluating the objective function, which might involve time-consuming simulations or expensive experiments, the surrogate model is used as a proxy to guide the optimization process.This involves iteratively updating the surrogate model based on a limited set of evaluations of the true objective function.Then the surrogate model ( 6) , where − ∞ < α < ∞ www.nature.com/scientificreports/ is used to predict the objective function values at unexplored points in the search space.These predictions are used to select new points to evaluate the true objective function, aiming to improve the overall optimization process efficiently.The primary advantage of surrogate optimization is its ability to reduce the computational cost of optimization by replacing expensive function evaluations with inexpensive surrogate model predictions 35 .Using ( 8), α opt and µ opt are estimated and used to classify data in the test phase.It can summarize the working principle of the EL-APMC in algorithm 1 and Fig. 2 shows the working principles of the EL-APMC Algorithm.

Result analysis and discussion
In this section, we investigate the effectiveness of the integration of the proposed MRI features with different machine learning algorithms as well as the EL-APMC algorithm for the classification of glioma grades based on the BRATS 2015 dataset.The environment used for the classification is MATLAB 36 since it has various tools that support machine learning tasks.We evaluated the glioma grade classification dataset on twenty-one machine learning models available in MATLAB that represent the state-of-the-art machine learning models.As known 37 deep learning algorithms are only effective in large datasets and fail to achieve a good performance in small datasets size.Since our dataset size is about 275 instances, we have not included deep learning algorithms in the comparison.Table 2 shows the basic default parameter values for the 21 classifier models.The classification results are averaged over fivefold cross-validation.The parameters used in the training EL-APMC algorithm are defined in Table 3.Many evaluation metrics have been measured and evaluated such as the classification accuracy, recall, precision, and F1 score.These metrics are the most familiar tools used to measure the performance of a classification model.All these metrics measures are derived from the confusion matrix defined in Table 4.Where True Positive (TP) represents the number of instances that the model predicts as positive where they are actual positive instances.False Negative (FN) is the number of instances that the model predicts as negative where they are actual positive instances.False Positive (FP) is the number of instances that the model predicts as positive where they are actual negative instances.True Negative (TN) is the number of instances that the model predicts as negative where they are actual negative instances.The performance measures metrics such as accuracy, recall, precision, and F1-score are derived from confusion matrix parameters (TP, FN, FP, and TN) as defined in ( 9)-( 12).The accuracy metric measures the ability of a model to identify the true total positive and negative instances compared to the total instances.In the case of the imbalanced dataset, the accuracy measure provides inaccurate results since the class with a high majority overwhelms the minority class.The recall metric tries to capture how many positive instances are predicted compared to the actual positive instance.This will be beneficial in case there is a high cost related to the prediction of false negatives.Precision metric measures how accurate the classification model is in predicting positive instances, in other words, how many of them are actual positive instances.This will be beneficial in case there is a high cost related to the false positive.The F1-score metric is the harmonic mean of the recall and precision metrics, it will benefit when both recall and precision are important and the average results of both metrics are needed.The classification results are evaluated against four metrics, which are accuracy, recall, precision, and F1-score over 22 classifier models.Table 5 shows the performance of classification models ranked in terms of classification accuracy in decreasing order.
Figure 3 shows a comparison among different evaluation metrics using the box plot.The purpose of the comparison is to statistically summarize the performance of classifier models among evaluation matrices.As shown classifiers show large variability across recall scores compared to other metrics, while the variability is minimal regarding the F1-score.This is because the F1-score takes the harmonic mean of recall and precision resulting in reducing the variability.The average score results among different metrics are 85.57%, 94.58%, 88.38%, and 91.31%, for accuracy, recall, precision, and F1-score respectively.Figures 4, 5, 6 and 7 visualize the results of Table 5 in terms of accuracy, recall, precision, and F1-score.As shown in Fig. 4, the EL-APMC algorithm achieved the best performance in terms of classification accuracy compared to the 21 Classifier models under comparison and logistic regression scored second and linear SVM scored third.Figure 5 shows the performance of classifiers ranked in descending order in terms of recall metric where fine Gaussian SVM ranks first followed by coarse Gaussian SVM ranks second and the EL-APMC algorithm ranks third.It is obvious from the previous results that the variants of SVM classifiers work best for precision metric.Figure 6 shows the classifier models rank in descending order in terms of precision metric where ensemble RUS boosted trees rank first followed by decision tree ranks second, Kernel Naive Bayes ranks third and the EL-APMC algorithm ranks 8th place.It is clear from the previous results that the variants of decision tree classifiers work best for precision metric.
The F1 score measures the average values of recall and precision and it is considered a crucial metric in the case of an imbalanced dataset where the accuracy metric provides inaccurate results.Figure 7 shows the performance of classifier models in terms of F1-score ranked in descending order.The EL-APMC algorithm ranks in first place followed by Linear SVM which ranks second and ensemble subspace discriminant ranks third place.In summary, the EL-APMC algorithm provides the best performance in terms of accuracy and F1 score since its structure combines two strategies that help in achieving these results.
First, the EL-APMC model uses the idea of ensemble learning in which instead of using a single classifier model, the EL-APMC model uses an ensemble of machine learning models that improves the classification performance.Second, unlike popular ensemble learning methods that use the fixed fusion method.The EL-APMC model is an adaptive fusion method called Power Mean Combiner (PMC).During the training process of the EL-APMC algorithm, the PMC is trained to match the statistics of the base classifier outputs of the EL-APMC model.In comparison to the standard ensemble-learning algorithm, for example, the ensemble bagged trees used a fixed fusion method called the majority-voting rule.Using a fixed fusion method in ensemble learning limited their predicting ability, especially in the case of limited dataset size.
Optimization method Surrogate optimization Defaults setting 36 To study the statistical significance of the results given in Table 5, we analyze the classification results (accuracy, recall, precision, and F1-score) in terms of a sample mean, sample standard deviation, and hypothesis tests.The purpose of the sample mean and the standard deviation is to evaluate the overall performance of machine learning algorithms.The results in Table 6 show the classification models achieved on average a good performance on recall (94.44%) and F1-score (91.22%) metrics.There is a high standard deviation in the classification results in terms of recall (0.0355) and precision (0.027) compared to the accuracy and F1 score metric.In other words, classifier models exhibit high variability in performance in terms of recall and precision compared to other metrics.The purpose of hypothesis tests is to make sure that the differences in the classification results have statistical significance or not.For this purpose, we use the one-sample Kolmogorov-Smirnov test.The null hypothesis states that the classification results in terms of accuracy, recall, precision, and F1-score come from a specific distribution versus the alternative hypothesis that the samples do not come from such a distribution at a 5% significance level.The P-value shown in Table 6 shows small values for all metrics i.e. less than 5% which means rejecting the null hypnosis and differences in the classification results have statistical significance.
In terms of estimating the complexity of the EL-APMC compared to other algorithms.We use the notation O(n) , where n is the number of analyzing loops, recursive calls, and other control structures in the algorithm.In the training phase, EL-APMC used two main steps; training N base classifiers and running a surrogate algorithm.In comparison to other machine learning algorithms used in this work, we can estimate the time complexity of

Figure 1 .
Figure 1.The MRI images of Grade IV glioma exhibit distinct characteristics in terms of the morphology of the brain tumor.These characteristics include the presence of tumor enhancement in the T1 images after contrast enhancement.The center of the tumor is marked by necrosis, while edema surrounds the tumor and is visible in the T2 images.

Figure 3 .
Figure 3. Box plot for different evaluation metrics.

Figure 4 .
Figure 4. Ranking classifiers models according to their classification accuracy values.

Figure 5 .
Figure 5. Ranking classifiers models according to their recall values.

Figure 6 .
Figure 6.Ranking classifiers models according to their precision values.

Figure 7 .
Figure 7. Ranking classifiers models according to their F1-score values.

Table 1 .
Summary of works used various classification models for glioma grading.

Table 4 .
Details of confusion matrix.

Table 5 .
Performance of classifier models in terms of accuracy, recall, precision, and F1-score.

Table 6 .
Statistical analysis for results of classification models.